According to the American Time Use Survey (ATUS, https://www.bls.gov/charts/american-time-use/activity-by-work.htm), 85.8 percent of males and 66.5 percent of females work more than 40 hours per week in the U.S. Working overtime has been associated with diverse health problems, especially ischemic heart diseases and cerebrovascular diseases (Nishiyama & Johnson, 1997). Therefore, this project aims to examine whether working overtime is associated with the blood pressure level using the National Health and Nutrition Examination Survey (NHANES) data. In order to produce sufficient estimation sample size, this study uses a pooled data from the three two-year cycle data sets: 2011-2012, 2013-2014, and 2015-2016.
It is well known that various biological and environmental factors influence the blood pressure level (Lauer, 1993; Papathanasiou, 2015). To control for the potential influences of such factors, we have systematically reviewed the literature to select the important factors that might affect the blood pressure levels of the respondents. Final predictors, other that working hours, includes in the model were gender, age, body mass index (BMI), alcohol consumption level, hours of sleep, and history of smoking. These final predictors closely match with those from other published studies that observed the association between work hours and blood pressure levels (Hayashi, 1996; Nakamura et al. 2012).
In terms of defining the criteria for the key variables of interest, working overtime is defined as working more than 40 hours per week. For the additional logistic regression analyses, abnormal blood pressure is defined as having systolic blood pressure higher than 140mm Hg and diastolic blood pressure higher than 90mm Hg following the guide from the American Heart Association.
The study population are participants, aged 18 - 60 and employed, of The National Health and Nutrition Examination Survey (NHANES) program in three two-years cycles: 2011-2012, 2013-2014 and 2015-2016. Among them, 2011-2012 cycle consists of 2,495 sample size, 2013-2014 cycle contains 2,819 and 2015-2016 cycle includes 2,799. For the purpose of this study, we exclude people who are now taking prescription due to hypertension. The exclusion leads to huge amount of reduction to our sample size: there are 100 observations left for 2011-2012, 95 for 2013-2014 and 140 for 2015-2016.
Among 335 total sample sizes, we pick variables through literature reviews and group discussion. Dataset varies for each two-years cycle by the suffix in the name. 2011-2012 is G, 2013-2014 is H, and 2015-2016 is I. For the convenience, the suffix will be replaced by *. Variables chosen by this project are listed below:
The measurements of blood pressure: Systolic blood pressure and Diastolic blood pressure from BPX_* dataset. Each blood pressure is measured in 3 trials, sometimes 4 trials,
Working hours in the last week from OCQ_* dataset,
Body mass index from BMX_* dataset,
Demographic data: age and gender from DEMO_* dataset,
Alchol used in the last 12 months: ALQ120Q for the quantity and ALQ120U to differentiate unit of week, month or year, from ALQ_* dataset,
Smoked tobacco in last five days: SMQ681 in 2013-2016, and SMQ680 in 2011-2012, from SMQRTU_* dataset,
Sleeping hours in working days: SLD012 in 2015-2016, and SLD010H for others from SLQ_* dataset.
The cleaning process involves excluding missing values for complete case analysis, and excludig cases where there are more than 2 trials blood pressure measurements missing. After the cleaning process, the sample size for this study is 238 in total.
This study implements multiple regression on each type (systolic and diatolic) blood pressure against working hours controlled by all other variables to assess if working overtimes would cause higher blood pressure. Coefficients and their p-values on statistic significance are reported.
The response variable blood pressure is the average value of blood pressures measured in 3 trials (sometimes 4 if one previous trial is missing); the variable working hours is encoded into dichotomy variable where 1 stands for working more than 40 hours and 0 otherwise; alchol drinks are converted to average drinks in a week, so values with unit month are divided by 4.345 and by 52.143 for values with unit year. 3 digits are preserved; due to the difference of encoding of sleep variable in 2015-2016 (it has 14 meaning more 14 hours, but this limit is 12 in 2011-2014), we recode all sleep hours that are more than 12 to 12 to keep the consistency in sample dataset.
The additional analysis involves adding interactions, transformations to our classical linear model. Logistics regression is applied to examine the associations between the predictors and the risk of having abnormal blood pressure.
We use data.table package, dplyr package and python for the data preparation process. The multiple regression is implemented in python, lm function and gls function in R.
After data cleaning, we visualized our data with the matrix plots first. As shown in the matrix plots for the systolic blood pressure, we found that there were weak correlations between the variables. As for diastolic blood pressure, the matrix plots show that there are weak relations or almost no relations between the variables. Moreover, from the histograms, we found that some variables were not normally distributed; however, most variables, including systolic and diastolic blood pressure, were normally distributed.
pairs( ~avg_sys_bp + gender + age + bmi + sleep + smoke + workhrs +
alchol, data = analysis_dt, main = "simple matrix plot for Systolic blood pressure")
pairs( ~avg_dia_bp + gender + age + bmi + sleep + smoke + workhrs +
alchol, data = analysis_dt, main = "simple matrix plot for Diastolic blood pressure")
par( mfrow=c(3,2) )
hist(analysis_dt$avg_sys_bp)
hist(analysis_dt$avg_dia_bp)
hist(analysis_dt$age)
hist(analysis_dt$bmi)
hist(analysis_dt$sleep)
hist(analysis_dt$alchol)
After the preliminary analysis on the data, we fitted our linear regression models.
For the systolic blood pressure, we fitted the following model:
\[ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} \]
For diastolic blood pressure, we fitted the following model:
\[
\begin{aligned}
y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\
&+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon
\end{aligned}
\]
## systolic bp
pre_sys_fit = lm(avg_sys_bp ~ gender + age + bmi + sleep + smoke + workhrs + alchol,
data = analysis_dt)
summary(pre_sys_fit)
##
## Call:
## lm(formula = avg_sys_bp ~ gender + age + bmi + sleep + smoke +
## workhrs + alchol, data = analysis_dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.867 -12.271 -2.392 10.500 76.693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.29276 11.82799 7.634 6.05e-13 ***
## gender -3.62489 2.74377 -1.321 0.187770
## age 0.47042 0.12723 3.697 0.000273 ***
## bmi 0.64716 0.20248 3.196 0.001588 **
## sleep 0.49185 0.98005 0.502 0.616243
## smoke 1.19416 2.65097 0.450 0.652802
## workhrs 1.57530 2.67687 0.588 0.556784
## alchol 0.01928 0.71959 0.027 0.978645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.21 on 230 degrees of freedom
## Multiple R-squared: 0.08855, Adjusted R-squared: 0.06081
## F-statistic: 3.192 on 7 and 230 DF, p-value: 0.003031
## diatolic bp
pre_dia_fit = lm(avg_dia_bp ~ gender + age + sleep + smoke + workhrs + alchol,
data = analysis_dt)
summary(pre_dia_fit)
##
## Call:
## lm(formula = avg_dia_bp ~ gender + age + sleep + smoke + workhrs +
## alchol, data = analysis_dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.710 -8.225 0.728 8.255 33.789
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 69.90643 6.50533 10.746 <2e-16 ***
## gender -2.77089 1.93383 -1.433 0.1533
## age 0.22960 0.09145 2.511 0.0127 *
## sleep 0.07160 0.70738 0.101 0.9195
## smoke -0.04585 1.92066 -0.024 0.9810
## workhrs 0.50563 1.94090 0.261 0.7947
## alchol -0.12419 0.51884 -0.239 0.8110
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.94 on 231 degrees of freedom
## Multiple R-squared: 0.03657, Adjusted R-squared: 0.01154
## F-statistic: 1.461 on 6 and 231 DF, p-value: 0.1924
From the summary results of the linear regression model for the systolic blood pressure, we found the coefficient of the working hours is positive, which implies that working overtime increases systolic blood pressure level; however, this effect is not significant (\(\beta = 1.58\), p = 0.557). As a result, we can conclude that working overtime is not significantly correlated to systolic blood pressure. Similarly, from the summary results of the linear regression model for diastolic blood pressure, we found that the coefficient of the working hours is positive, which implies that working overtime increases diastolic blood pressure level; however, this effect was not significant as well (\(\beta = 0.2246\), p = 0.905). As a result, we conclude that working overtime is not significantly correlated to diastolic blood pressure.
After fitting the linear regression model, we checked whether the following assumptions of the linear regression model are held:
1. Homoscedasticity: The variance of the residual is constant. 2. Linearity: The relationship between the independent and dependent variables is linear. 3. No or little multicollinearity: There is no or little collinearity between independent variables.
1. Homoscedasticity
In order to check the homoscedasticity assumption, we plotted standardized residuals against fitted values plots for both systolic blood pressure and diastolic blood pressure.
par( mfrow=c(1,2) )
## standardized residuals plot for systolic bp
st_sys_red = rstandard(pre_sys_fit)
fitted_sys = pre_sys_fit$fitted.values
plot(fitted_sys, st_sys_red, xlab = "fitted values",
ylab = "standardized residuals",
main = "standardized residuals plot (systolic bp)")
## standardized residuals plot for diatolic bp
st_dia_red = rstandard(pre_dia_fit)
fitted_dia = pre_sys_fit$fitted.values
plot(fitted_dia, st_dia_red, xlab = "fitted values",
ylab = "standardized residuals",
main = "standardized residuals plot (diatolic bp)")
From the standardized residuals plot of the systolic blood pressure, we could find the mean of the residual was roughly 0, and the variance was roughly constant. Also, we had the same findings for diastolic blood pressure. Therefore, the assumption of homoscedasticity could be considered as satisfied.
2. No or little multicollinearity
In order to check the multicollinearity, we computed the Pearson correlations between each two continuous independent variables and conducted Chi-squared correlation test between each two binary independent variables.
cov_mat = vcov(pre_sys_fit)
cov_mat[-1, -1]
## gender age bmi sleep smoke
## gender 7.52828033 -0.022099117 -0.13212674 -0.280415922 0.43262585
## age -0.02209912 0.016188377 0.00353704 0.007186922 0.04133382
## bmi -0.13212674 0.003537040 0.04099638 0.020411778 0.02974110
## sleep -0.28041592 0.007186922 0.02041178 0.960493109 0.07954100
## smoke 0.43262585 0.041333816 0.02974110 0.079541003 7.02762846
## workhrs 0.78274008 -0.024336217 -0.02142290 -0.109445473 0.80275098
## alchol 0.47160755 -0.011946925 0.01639422 -0.047810983 -0.34321186
## workhrs alchol
## gender 0.78274008 0.47160755
## age -0.02433622 -0.01194692
## bmi -0.02142290 0.01639422
## sleep -0.10944547 -0.04781098
## smoke 0.80275098 -0.34321186
## workhrs 7.16563783 0.05594737
## alchol 0.05594737 0.51781435
## check correlation between indicators
gs = chisq.test(analysis_dt$gender, analysis_dt$smoke)
gw = chisq.test(analysis_dt$gender, analysis_dt$workhrs)
sw = chisq.test(analysis_dt$smoke, analysis_dt$workhrs)
## make tables
row = c("gender", "gender", "smoke")
col = c("smoke", "workhrs", "workhrs")
chi_sq_stats = c( gs$statistic, gw$statistic, sw$statistic )
p_value = c(gs$p.value, gw$p.value, sw$p.value)
m = as.data.table(cbind(row, col, chi_sq_stats, p_value))
m
## row col chi_sq_stats p_value
## 1: gender smoke 3.15347205364872 0.0757655899731777
## 2: gender workhrs 1.10585776488131 0.292984154257689
## 3: smoke workhrs 2.92869646016623 0.0870177267808233
As shown in the correlation table between each two continuous variables, we could find that there are very small or almost no correlations between the continuous independent variables. Also, as the table of the Pearson chi-squared correlation test results showed, we could find that there is little collinearity between each pair of the binary variables.
3. Linearity
In order to check the linearity assumption, we plotted partial regression plots for both systolic blood pressure and diastolic pressure against each of the independent variables.
avPlots(pre_sys_fit)
avPlots(pre_dia_fit)
From the partial regression plots of the systolic blood pressure, we could find that for each of the independent variables, the expected value of the dependent variable (systolic blood pressure) was indeed a linear function of the independent variable, controlled by all the other variables. We can get the same conclusion from the partial regression plots of diastolic blood pressure. Therefore, the assumption of linearity can be considered as satisfied.
After examing the mutiple linear regression in the core analysis, I’m interested in a more complicated model considering intercations and transformations to adjust the non-linearity and non-constant variance.
The following model considers the logarithm of average systolic blood pressure. For variables alchol and sleep, the model apply them into basis splines function with degrees of freedom 5 and order of three, with 2 knots. Interaction effect between worhrs and all other variables are considered. The estimations and significance level are reported in the following:
fit_sys = lm( log(avg_sys_bp) ~ workhrs * (age + gender + bmi + bs(alchol, 5) +
bs(sleep, 5) + smoke) + smoke:alchol,
data = dt)
summary(fit_sys)
##
## Call:
## lm(formula = log(avg_sys_bp) ~ workhrs * (age + gender + bmi +
## bs(alchol, 5) + bs(sleep, 5) + smoke) + smoke:alchol, data = dt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.38186 -0.07910 -0.00831 0.06629 0.43370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4731977 0.1079382 41.442 < 2e-16 ***
## workhrs -0.1317869 0.2596297 -0.508 0.612277
## age 0.0041774 0.0011083 3.769 0.000213 ***
## gender -0.0439523 0.0252797 -1.739 0.083585 .
## bmi 0.0054177 0.0017799 3.044 0.002639 **
## bs(alchol, 5)1 0.1191565 0.0617548 1.930 0.055035 .
## bs(alchol, 5)2 0.0299150 0.0536695 0.557 0.577861
## bs(alchol, 5)3 0.2629379 0.1224777 2.147 0.032970 *
## bs(alchol, 5)4 0.0177825 0.1344588 0.132 0.894913
## bs(alchol, 5)5 0.0317093 0.0789881 0.401 0.688508
## bs(sleep, 5)1 -0.1787615 0.0981858 -1.821 0.070105 .
## bs(sleep, 5)2 0.1319643 0.0798247 1.653 0.099811 .
## bs(sleep, 5)3 -0.2097746 0.0955405 -2.196 0.029227 *
## bs(sleep, 5)4 0.2610917 0.1322199 1.975 0.049634 *
## bs(sleep, 5)5 -0.2125092 0.1311169 -1.621 0.106591
## smoke 0.0260160 0.0275418 0.945 0.345964
## workhrs:age -0.0018738 0.0020102 -0.932 0.352342
## workhrs:gender 0.0469979 0.0421262 1.116 0.265867
## workhrs:bmi 0.0009652 0.0031768 0.304 0.761566
## workhrs:bs(alchol, 5)1 -0.0850794 0.1149464 -0.740 0.460038
## workhrs:bs(alchol, 5)2 -0.1035043 0.0948057 -1.092 0.276211
## workhrs:bs(alchol, 5)3 -0.1711782 0.2228871 -0.768 0.443360
## workhrs:bs(alchol, 5)4 -0.2518054 0.2453225 -1.026 0.305889
## workhrs:bs(alchol, 5)5 0.0244100 0.1585296 0.154 0.877777
## workhrs:bs(sleep, 5)1 0.4848052 0.2807767 1.727 0.085721 .
## workhrs:bs(sleep, 5)2 0.0927614 0.2088909 0.444 0.657459
## workhrs:bs(sleep, 5)3 0.6274614 0.2494358 2.516 0.012645 *
## workhrs:bs(sleep, 5)4 -0.2045744 0.3723757 -0.549 0.583339
## workhrs:bs(sleep, 5)5 1.3149068 0.6421546 2.048 0.041857 *
## workhrs:smoke -0.1009870 0.0415018 -2.433 0.015810 *
## smoke:alchol 0.0085202 0.0106651 0.799 0.425269
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1335 on 207 degrees of freedom
## Multiple R-squared: 0.2317, Adjusted R-squared: 0.1204
## F-statistic: 2.081 on 30 and 207 DF, p-value: 0.00154
AIC(fit_sys)
## [1] -252.4279
The model has R-squared 0.2317. The F-test against the null model calcutes the statistic of 2.081 with 30 and 207 degrees of freedom. The p-value is 0.00154, which suggests the statistical significance of this model in 0.05 confidence level. The AIC of this model is -252.428.
vif(fit_sys)
## GVIF Df GVIF^(1/(2*Df))
## workhrs 200.980384 1 14.176755
## age 1.662069 1 1.289213
## gender 2.074395 1 1.440276
## bmi 1.823955 1 1.350539
## bs(alchol, 5) 25.003111 5 1.379747
## bs(sleep, 5) 6.382317 5 1.203643
## smoke 2.421995 1 1.556276
## workhrs:age 25.662396 1 5.065807
## workhrs:gender 2.537148 1 1.592843
## workhrs:bmi 32.073578 1 5.663354
## workhrs:bs(alchol, 5) 223.496453 5 1.717620
## workhrs:bs(sleep, 5) 584.091711 5 1.890811
## workhrs:smoke 2.163470 1 1.470874
## smoke:alchol 4.161601 1 2.040000
When checking the variance inflation factor, I find that workhrs, workhrs:age, and workhrs:bmi are greatly inflated (more than 5). It suggests that there might be significant effect of collinearity, which may unstable the coefficients estimations.
par( mfrow = c(2,2) )
plot(Effect("age", fit_sys, partial.residuals = TRUE))
plot(Effect("bmi", fit_sys, partial.residuals = TRUE))
plot(Effect("alchol", fit_sys, partial.residuals = TRUE))
plot(Effect("sleep", fit_sys, partial.residuals = TRUE))
Above plots examine the effect of each continuous variables in the model on the response variable. The blue lines shows the relationship from the model. The red dots are partial residuals and the red line shows how each variable correlates to the response variable after controlling all other variables. We see that our model dose quite well on explaining the correlation for age, bmi and alchol. The basis spline function of Sleep, as we can see, become very unstable at the boundary points. This might suggests extrapolation bias.
The coefficients of this model is very hard to interpret, even though the model roughly satisfies the linear modeling assumptions. However, the model is still useful to explain several research questions. In the following example, I examine if sleep for 5 hours would cause significant difference on blood pressure than sleep for 8 hours if they both work overtimes.
To answer the question, if we directly use the model, we need to consider interaction terms on variables interested, and coefficients of basis splines transformation on sleep are more meaningful in mathematic rather than practical setting.
The code below create two fake datasets accroding to the original dataset. One with all sleep equals to 8 while the other equal to 5. Both dataset has workhrs as 1. After combining the original dataset with two fake datasets, we assign weights 1 to all original observations and zero to all fake observations. The purpose of assigning weights is to get the same coefficients with the previously fitted model and the same structure of the design matrix (same interactions and transformation, but with fake data). Above steps help us to get the contrast value by applying the model coefficients to our data generated by test assumptions.
# would difference between work overtimes but sleep for 8 hours and work overtimes
# while sleep for 5 hours less significant?
dt = dt[, wgt := 1]
d_fake_8hrs = copy(dt)
d_fake_8hrs[, `:=` (workhrs = 1, sleep = 8, wgt = 0)] # make fake data with contrast
d_fake_5hrs = copy(dt)
d_fake_5hrs[, `:=` (workhrs = 1, sleep = 5, wgt = 0)] # make fake data with contrast
dx = rbindlist( list(dt, d_fake_8hrs, d_fake_5hrs) ) # combine tru data with fake data
result = lm( log(avg_sys_bp) ~ workhrs * (age + gender + bmi + bs(alchol, 5) +
bs(sleep, 5) + smoke) + smoke:alchol,
weights = wgt, data = dx) # fit regression with zero weights on fake data
pa = coef(result) # parameters
cm = vcov(result) # covariance matrix
dm = model.matrix(result) # design matrix
nx = 238
ct = dm[ (nx+1):(nx+2*nx), ]
ct = ct[1:nx,] - ct[(nx+1):(2*nx), ] # get the contrast
znum = ct %*% pa # numerator of z-score
zdenom = sqrt(diag(ct %*% cm %*% t(ct))) # denominator of z-score
zscores = znum / zdenom
z_criticle = qnorm(0.975)
{mean(zscores) > z_criticle}
## [1] FALSE
The z-score is 0.864, which is lesser than 1.96. We failed to reject the null hypothesis that sleep 5 hours would cause difference in blood pressur than sleep 8 hours if both people work overtimes.
After data cleaning, we visualized our data with the matrix plots first. As shown in the matrix plots for the systolic blood pressure, we found that there were weak correlations between the variables. As for diastolic blood pressure, the matrix plots show that there are weak relations or almost no relations between the variables. Moreover, from the histograms, we found that some variables were not normally distributed; however, most variables, including systolic and diastolic blood pressure, were normally distributed.
#test the assumptions for the linear regression
#assumption1) normality assumption: check if the continuous variables are normal
par( mfrow=c(3,3) )
#DVs
hist(df$SBP)
hist(df$DBP)
#IVs
hist(df$age)
hist(df$bmi)
hist(df$sleep)
hist(df$alcohol_adj)
#check the distributions
hist(df$overtime)
hist(df$SBP_bi)
hist(df$DBP_bi)
#for SPB
pairs(SBP ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke, data = df)
#for DBP
pairs(DBP ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke, data = df)
We wanted to check whether the following assumptions of the linear regression model are held:
1. Homoscedasticity: The variance of the residual is constant. 2. Linearity: The relationship between the independent and dependent variables is linear. 3. No or little multicollinearity: There is no or little collinearity between independent variables.
1. Linearity
In order to check the linearity assumption, we plotted partial regression plots for both systolic blood pressure and diastolic pressure against each of the independent variables.
#assumption3) linearity
#first fit a linear regression with all covariates
avplot_SBP = lm(SBP ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke, data = df)
avplot_DBP = lm(DBP ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke, data = df)
#then see the added variable plots
car::avPlots(avplot_SBP) #for SBP
car::avPlots(avplot_DBP) #for DBP
From the partial regression plots of the systolic blood pressure, we could find that for each of the independent variables, the expected value of the dependent variable (systolic blood pressure) was indeed a linear function of the independent variable, controlled by all the other variables. We can get the same conclusion from the partial regression plots of diastolic blood pressure. Therefore, the assumption of linearity can be considered as satisfied.
2. Homoscedasticity
In order to check the homoscedasticity assumption, we plotted standardized residuals against fitted values plots for both systolic blood pressure and diastolic blood pressure.
#standardized residuals plot
#SBP
SBP_red = rstandard(avplot_SBP)
fitted_SBP = avplot_SBP$fitted.values
plot(fitted_SBP, SBP_red, xlab = "Fitted Values",
ylab = "Standardized Residuals",
main = "Standardized Residuals Plot (SBP)")
#DBP
DBP_red = rstandard(avplot_DBP)
fitted_DBP = avplot_DBP$fitted.values
plot(fitted_DBP, DBP_red, xlab = "Fitted Values",
ylab = "Standardized Residuals",
main = "Standardized Residuals Plot (DBP)")
From the standardized residuals plot of the systolic blood pressure, we could find the mean of the residual was roughly 0, and the variance was roughly constant. Also, we had the same findings for diastolic blood pressure. Therefore, the assumption of homoscedasticity could be considered as satisfied.
3. No or little multicollinearity
In order to check the multicollinearity, we computed the Pearson correlations between each two continuous independent variables and conducted Chi-squared correlation test between each two binary independent variables.
#collinearity for continuous variables
cont_df = df %>%
select(age, bmi, alcohol_adj, sleep)
cor(cont_df)
## age bmi alcohol_adj sleep
## age 1.00000000 -0.13615718 0.11946370 -0.02952827
## bmi -0.13615718 1.00000000 -0.21568224 -0.08195345
## alcohol_adj 0.11946370 -0.21568224 1.00000000 0.04966028
## sleep -0.02952827 -0.08195345 0.04966028 1.00000000
#correlation coefficients have small values
#collinearity for categorical variables
cate_df = df %>%
select(gender, smoke, overtime)
chisq.test(cate_df$gender, cate_df$smoke)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cate_df$gender and cate_df$smoke
## X-squared = 3.1535, df = 1, p-value = 0.07577
chisq.test(cate_df$gender, cate_df$overtime)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cate_df$gender and cate_df$overtime
## X-squared = 1.1059, df = 1, p-value = 0.293
chisq.test(cate_df$smoke, cate_df$overtime)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: cate_df$smoke and cate_df$overtime
## X-squared = 2.9287, df = 1, p-value = 0.08702
#no significant associations
As shown in the correlation table between each two continuous variables, we could find that there are very small or almost no correlations between the continuous independent variables. Also, as the table of the Pearson chi-squared correlation test results showed, we could find that there is little collinearity between each pair of the binary variables.
Here, we fit our linear regression models.
For the systolic blood pressure, we fit the model:
\[
\begin{aligned}
y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\
&+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon
\end{aligned}
\]
And for diatolic blood pressure, we fit the model: \[
\begin{aligned}
y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\
&+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon
\end{aligned}
\]
#core analysis ----------------------------------------------
#linear regression with "overtime" as the main predictor
model_SBP = gls(SBP ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke, data = df)
summary(model_SBP)
## Generalized least squares fit by REML
## Model: SBP ~ overtime + gender + age + bmi + alcohol_adj + sleep + smoke
## Data: df
## AIC BIC logLik
## 2079.396 2110.338 -1030.698
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 91.48710 11.615224 7.876482 0.0000
## overtime 1.57529 2.676871 0.588481 0.5568
## genderfemale -3.62499 2.743774 -1.321168 0.1878
## age 0.47042 0.127233 3.697308 0.0003
## bmi 0.64715 0.202476 3.196195 0.0016
## alcohol_adj 0.01917 0.719598 0.026644 0.9788
## sleep 0.49186 0.980048 0.501874 0.6162
## smokeNotSmoked -1.19423 2.650964 -0.450489 0.6528
##
## Correlation:
## (Intr) overtm gndrfm age bmi alchl_ sleep
## overtime 0.005
## genderfemale 0.102 0.107
## age -0.544 -0.071 -0.063
## bmi -0.652 -0.040 -0.238 0.137
## alcohol_adj -0.101 0.029 0.239 -0.130 0.113
## sleep -0.637 -0.042 -0.104 0.058 0.103 -0.068
## smokeNotSmoked -0.033 -0.113 -0.059 -0.123 -0.055 0.180 -0.031
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.2317121 -0.6388468 -0.1245573 0.5466566 3.9927520
##
## Residual standard error: 19.20794
## Degrees of freedom: 238 total; 230 residual
model_DBP = gls(DBP ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke, data = df)
summary(model_DBP)
## Generalized least squares fit by REML
## Model: DBP ~ overtime + gender + age + bmi + alcohol_adj + sleep + smoke
## Data: df
## AIC BIC logLik
## 1919.121 1950.063 -950.5603
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 49.75757 8.198020 6.069462 0.0000
## overtime 0.22461 1.889335 0.118884 0.9055
## genderfemale -4.50410 1.936555 -2.325832 0.0209
## age 0.27600 0.089801 3.073463 0.0024
## bmi 0.53780 0.142907 3.763260 0.0002
## alcohol_adj 0.09092 0.507892 0.179023 0.8581
## sleep 0.33936 0.691717 0.490607 0.6242
## smokeNotSmoked -0.34427 1.871050 -0.183997 0.8542
##
## Correlation:
## (Intr) overtm gndrfm age bmi alchl_ sleep
## overtime 0.005
## genderfemale 0.102 0.107
## age -0.544 -0.071 -0.063
## bmi -0.652 -0.040 -0.238 0.137
## alcohol_adj -0.101 0.029 0.239 -0.130 0.113
## sleep -0.637 -0.042 -0.104 0.058 0.103 -0.068
## smokeNotSmoked -0.033 -0.113 -0.059 -0.123 -0.055 0.180 -0.031
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -3.43931088 -0.53464833 0.06774935 0.60387772 2.87705186
##
## Residual standard error: 13.55695
## Degrees of freedom: 238 total; 230 residual
From the summary results of the linear regression model for the systolic blood pressure, we found the coefficient of the working hours is positive, which implies that working overtime increases systolic blood pressure level; however, this effect is not significant (\(\beta\) = 1.58, p = 0.588). As a result, we can conclude that working overtime is not significantly correlated to systolic blood pressure. Similarly, from the summary results of the linear regression model for diastolic blood pressure, we found that the coefficient of the working hours is positive, which implies that working overtime increases diastolic blood pressure level; however, this effect was not significant as well (\(\beta\) = 0.22, p = 0.905). As a result, we conclude that working overtime is not significantly correlated to diastolic blood pressure.
Here, I fit logistic regression (as additional analysis) to examine the association between the predictors and the risk of having the abnormal blood pressure level. Following the guide from the American Heart Association, the abnormal systolic and diastolic blood pressure was coded as 1 if the respondent had systolic blood pressure higher than 140mm Hg or diastolic blood pressure higher than 90mm Hg. If not, they were coded 0 to indicate normal blood pressure level.
#logistic regression with normal vs. abnormal BP outcome
logistic_SBP <- glm(SBP_bi ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke,
family = binomial(link = 'logit'),
data = df)
summary(logistic_SBP)
##
## Call:
## glm(formula = SBP_bi ~ overtime + gender + age + bmi + alcohol_adj +
## sleep + smoke, family = binomial(link = "logit"), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4280 -0.9191 -0.6722 1.1867 2.0208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16441 1.47603 -4.176 2.96e-05 ***
## overtime 0.23889 0.30418 0.785 0.432245
## genderfemale -0.18500 0.31844 -0.581 0.561264
## age 0.05307 0.01587 3.343 0.000828 ***
## bmi 0.06839 0.02381 2.872 0.004075 **
## alcohol_adj 0.06821 0.08061 0.846 0.397494
## sleep 0.12490 0.11583 1.078 0.280894
## smokeNotSmoked -0.08405 0.30729 -0.274 0.784450
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 301.1 on 237 degrees of freedom
## Residual deviance: 280.7 on 230 degrees of freedom
## AIC: 296.7
##
## Number of Fisher Scoring iterations: 4
logistic_DBP <- glm(DBP_bi ~ overtime + gender + age + bmi +
alcohol_adj + sleep + smoke,
family = binomial(link = 'logit'),
data = df)
summary(logistic_DBP)
##
## Call:
## glm(formula = DBP_bi ~ overtime + gender + age + bmi + alcohol_adj +
## sleep + smoke, family = binomial(link = "logit"), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0715 -0.6532 -0.4741 -0.3401 2.3798
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.74815 1.83339 -3.681 0.000233 ***
## overtime 0.41364 0.37478 1.104 0.269727
## genderfemale -0.62077 0.41894 -1.482 0.138402
## age 0.05215 0.02025 2.576 0.010003 *
## bmi 0.08619 0.02941 2.931 0.003380 **
## alcohol_adj 0.11436 0.09444 1.211 0.225925
## sleep 0.01283 0.14374 0.089 0.928902
## smokeNotSmoked -0.28536 0.38274 -0.746 0.455936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 215.54 on 237 degrees of freedom
## Residual deviance: 197.34 on 230 degrees of freedom
## AIC: 213.34
##
## Number of Fisher Scoring iterations: 5
For systolic blood pressure, the logistic regression model indicated that working overtime is related to increased risk of having abnomral systolic blood pressure (\(\beta\) = 0.24, p = 0.432); however, this association was not statistically significant. Among the covariate, higher age and higher bmi were significantly associated with increased risk of having abnormal systolic blood pressure.
For diastolic blood pressure, the logistic regression model showed that working overtime is related to increased risk of having abnomral diastolic blood pressure (\(\beta\) = 0.41, p = 0.269); however, this association was not statistically significant. Among the covariate, higher age and higher bmi were significantly associated with increased risk of having abnormal diastolic blood pressure.
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.graphics.regressionplots import plot_partregress_grid
import matplotlib.pyplot as plt
%matplotlib inline
After data cleaning, we visualized our data with the matrix plots first.
pair_plot_data_sys_bp = full_dt[['avg_BPXSY', 'gender', 'age', 'bmi', 'sleep',
'smoke', 'workhrs', 'avg_alcohol_freq_wk']]
ax = pd.plotting.scatter_matrix(pair_plot_data_sys_bp, alpha = 0.5, figsize = (14, 12));
plt.xticks(fontsize = 2)
plt.yticks(fontsize = 2);
pair_plot_data_dia_bp = full_dt[['avg_BPXDI', 'gender', 'age', 'bmi', 'sleep',
'smoke', 'workhrs', 'avg_alcohol_freq_wk']]
pd.plotting.scatter_matrix(pair_plot_data_dia_bp, alpha = 0.5, figsize = (14, 12))
plt.xticks(fontsize = 2)
plt.yticks(fontsize = 2);
As shown in the matrix plots for the systolic blood pressure, we found that there were weak correlations between the variables. As for diastolic blood pressure, the matrix plots show that there are weak relations or almost no relations between the variables. Moreover, from the histograms, we found that some variables were not normally distributed; however, most variables, including systolic and diastolic blood pressure, were normally distributed.
After the preliminary analysis on the data, we fitted our linear regression models.
For the systolic blood pressure, we fit the model: $$ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} $$
model_formula_sys_bp = 'avg_BPXSY ~ gender + age + bmi + sleep + smoke + workhrs + avg_alcohol_freq_wk'
model_sys_bp = smf.ols(formula = model_formula_sys_bp, data = full_dt)
model_fit_sys_bp = model_sys_bp.fit()
print(model_fit_sys_bp.summary())
For the systolic blood pressure, we fitted the following model: $$ \begin{aligned} y &= \beta_{0} + \beta_{workhrs}*x_{workhrs} + \beta_{gender}*x_{gender} + \beta_{age}*x_{age} + \beta_{bmi}*x_{bmi} \notag\\ &+ \beta_{sleep}*x_{sleep} + \beta_{smoke}*x_{smoke} + \beta_{avg\_alcohol\_freq\_wk}*x_{avg\_alcohol\_freq\_wk} + \epsilon \end{aligned} $$
model_formula_dia_bp = 'avg_BPXDI ~ gender + age + bmi + sleep + smoke + workhrs + avg_alcohol_freq_wk'
model_dia_bp = smf.ols(formula = model_formula_dia_bp, data = full_dt)
model_fit_dia_bp = model_dia_bp.fit()
print(model_fit_dia_bp.summary())
From the summary results of the linear regression model for the systolic blood pressure, we found the coefficient of the working hours is positive, which implies that working overtime increases systolic blood pressure level; however, this effect is not significant ($\beta = 1.58$, p = 0.557). As a result, we can conclude that working overtime is not significantly correlated to systolic blood pressure. Similarly, from the summary results of the linear regression model for diastolic blood pressure, we found that the coefficient of the working hours is positive, which implies that working overtime increases diastolic blood pressure level; however, this effect was not significant as well ($\beta = 0.2246$, p = 0.905). As a result, we conclude that working overtime is not significantly correlated to diastolic blood pressure.
After fitting the linear regression model, we checked whether the following assumptions of the linear regression model are held:
In order to check the homoscedasticity assumption, we plotted standardized residuals against fitted values plots for both systolic blood pressure and diastolic blood pressure.
# fitted values (need a constant term for intercept)
model_fitted_y_sys_bp = model_fit_sys_bp.fittedvalues
# model residuals
model_resid_sys_bp = model_fit_sys_bp.resid
# standardized residuals
model_std_resid_sys_bp = model_fit_sys_bp.get_influence().resid_studentized_internal
fig, ax = plt.subplots()
sns.residplot(model_fitted_y_sys_bp, model_std_resid_sys_bp, lowess = True,
scatter_kws = {'alpha': 0.5},
line_kws = {'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.title('standardized residual plot (systolic blood pressure)')
ax.set_xlabel('fitted values')
ax.set_ylabel('standardized residuals')
# fitted values (need a constant term for intercept)
model_fitted_y_dia_bp = model_fit_dia_bp.fittedvalues
# model residuals
model_resid_dia_bp = model_fit_dia_bp.resid
# standardized residuals
model_std_resid_dia_bp = model_fit_dia_bp.get_influence().resid_studentized_internal
# sqrt absolute standardized residuals
model_abs_sqrt_std_resid_dia_bp = np.sqrt(np.abs(model_std_resid_dia_bp))
# absolute residuals
model_abs_resid_dia_bp = np.abs(model_resid_dia_bp)
fig, ax = plt.subplots()
sns.residplot(model_fitted_y_dia_bp, model_std_resid_dia_bp, lowess = True,
scatter_kws = {'alpha': 0.5},
line_kws = {'color': 'red', 'lw': 1, 'alpha': 0.8})
plt.title('standardized residual plot (diatolic blood pressure)')
ax.set_xlabel('fitted values')
ax.set_ylabel('standardized residuals')
From the standardized residuals plot of the systolic blood pressure, we could find the mean of the residual was roughly 0, and the variance was roughly constant. Also, we had the same findings for diastolic blood pressure. Therefore, the assumption of homoscedasticity could be considered as satisfied.
In order to check the linearity assumption, we plotted partial regression plots for both systolic blood pressure and diastolic pressure against each of the independent variables.
fig = plt.figure(figsize=(8, 12))
plot_partregress_grid(model_fit_sys_bp, fig=fig)
plt.show()
fig = plt.figure(figsize=(8, 12))
plot_partregress_grid(model_fit_dia_bp, fig=fig)
plt.show()
From the partial regression plots of the systolic blood pressure, we could find that for each of the independent variables, the expected value of the dependent variable (systolic blood pressure) was indeed a linear function of the independent variable, controlled by all the other variables. We can get the same conclusion from the partial regression plots of diastolic blood pressure. Therefore, the assumption of linearity can be considered as satisfied.
In order to check the multicollinearity, we computed the Pearson correlations between each two continuous independent variables and the Pearson correlations between each two binary independent variables.
continuous_vars = full_dt[['age', 'bmi', 'sleep', 'avg_alcohol_freq_wk']]
corr_continuous = np.corrcoef(continuous_vars.values.T)
# corr_continuous
fig, ax = plt.subplots(figsize=(12, 10))
sns.heatmap(corr_continuous, annot=True, square=True, annot_kws = {"size": 12})
plt.title("correlations between continuous variables", fontsize = 16)
ax.set_xticklabels(continuous_vars)
ax.set_yticklabels(continuous_vars, va = 'center')
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
categorical_vars = full_dt[['gender', 'smoke', 'workhrs']]
categorical_vars.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson')
As shown in the heatmap of the correlations between the continuous independent variables, we could find that there are very small or almost no correlations between the continuous independent variables. Also, as the table of the Pearson chi-squared correlation test results showed, we could find that there is little collinearity between each pair of the binary variables.
Here, I fit the logistic regression (as additional analysis) to examine the association between the predictors and the risk of having the abnormal blood pressure level. Following the guide from the American Heart Association, the abnormal systolic and diastolic blood pressure were coded as 1 if the respondent had systolic blood pressure higher than 140mm Hg or diastolic blood pressure higher than 90mm Hg. If not, they were coded 0 to indicate normal blood pressure level.
# convert the variables systolic blood pressure and diatolic blood pressure into binary variables
full_dt['bin_avg_BPXSY'] = np.where(full_dt.avg_BPXSY >= 140, 1, 0)
full_dt['bin_avg_BPXDI'] = np.where(full_dt.avg_BPXDI >= 90, 1, 0)
# drop the variables avg_BPXSY and avg_BPXDI which are not needed for the logistic models
add_analysis_dt = full_dt.drop(['id', 'race', 'avg_BPXSY', 'avg_BPXDI'], axis = 1)
add_analysis_dt.head()
endog_sys_bp = add_analysis_dt.bin_avg_BPXSY
exog_sys_bp = sm.add_constant(add_analysis_dt.iloc[:, 0:7])
logit_model_sys_bp = sm.Logit(endog_sys_bp, exog_sys_bp)
logit_model_fit_sys_bp = logit_model_sys_bp.fit()
print(logit_model_fit_sys_bp.summary())
For the systolic blood pressure, the logistic regression model indicated that working overtime is related to increased risk of having abnomral systolic blood pressure ($\beta = 0.24$, p = 0.432); however, this association was not statistically significant. Among the covariates, higher age and higher bmi were significantly associated with increased risk of having abnormal systolic blood pressure.
endog_dia_bp = add_analysis_dt.bin_avg_BPXDI
exog_dia_bp = sm.add_constant(add_analysis_dt.iloc[:, 0:7])
logit_model_dia_bp = sm.Logit(endog_dia_bp, exog_dia_bp)
logit_model_fit_dia_bp = logit_model_dia_bp.fit()
print(logit_model_fit_dia_bp.summary())
For the diastolic blood pressure, the logistic regression model showed that working overtime is related to increased risk of having abnomral diastolic blood pressure ($\beta = 0.41$, p = 0.27); however, this association was not statistically significant. Among the covariates, higher age and higher bmi were significantly associated with increased risk of having abnormal diastolic blood pressure.
Hayashi, T., Kobayashi, Y., Yamaoka, K., & Yano, E. (1996). Effect of overtime work on 24-hour ambulatory blood pressure. Journal of occupational and environmental medicine, 38(10), 1007-1011.
Lauer, R. M., Mahoney, L. T., Clarke, W. R., & Witt, J. (1993). Childhood predictors for high adult blood pressure: the Muscatine Study. Pediatric Clinics of North America, 40(1), 23-40.
Nakamura, K., Sakurai, M., Morikawa, Y., Miura, K., Ishizaki, M., Kido, T., … & Nakagawa, H. (2012). Overtime work and blood pressure in normotensive Japanese male workers. American journal of hypertension, 25(9), 979-985.
Nishiyama, K., & Johnson, J. V. (1997). Karoshi–death from overwork: occupational health consequences of Japanese production management. International Journal of Health Services, 27(4), 625-641.
Papathanasiou, G., Zerva, E., Zacharis, I., Papandreou, M., Papageorgiou, E., Tzima, C., … & Evangelou, A. (2015). Association of high blood pressure with body mass index, smoking and physical activity in healthy young adults. The open cardiovascular medicine journal, 9, 5.